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1. Introduction 



For simplicity, let us describe the problem in two dimensions. Assume that 
we have a physical variable Z{x ,y) in some two-dimensional region. Thus Z (.v ,y ) may 
be ozone concentrations in the San Francisco Bay Area region. Assume also that this 
random field Z(x,y) is measured at fixed points Xj ,y/ ,/=l,2,...n . The problem consists of 
estimating Z (a' ,y ) at arbitrary points in the region from the given data. 

Nearest neighbor regression (NNR) functions are useful for estimating the 
value of Z(A,y). Nadaraya (1964) and Watson (1964) first independently proposed a 
NNR method which behaves well even when only little information on the distribution of 
the measured data points is known. Stone (1980) showed that non-parametric regression 
estimators of this type have uniform rate of convergence. Stute (1984) showed that if the 
£[Z^] is finite, these estimates are asymptotically nonnal. The same year Cheng (1984) 
showed that if the noise about the true regression has finite variance these estimates are 
uniformly consistent. Stute (1984) also showed that these estimates are more efficient 
than kernel estimators when there are a limited number observations in the neighborhood 
of the estimated point. 

Air Quality is monitored in the United States by monitoring apparatus of 
state, local, and federal networks. From 1974 to 1976, 5777 monitoring stations existed. 
These air quality data consist of discretely measured points which are not on a regular 
grid. Details regarding these data are given in (Johnson, 1983). Section 2 introduces 
NNR as it is applied to these data. Cross-validation is used for choosing the smoothing 
parameter. Section 3 introduces a concept called local variance and its relationship to the 
necessary amount of smoothing for a set of data. A measure of local viu'iance as a tunc- 
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tion of window size is also introduced in this section. Section 5 and 6 finds the window 



size for which mean local variance is minimized for three states and two pollutants. The 
window size is then related to the smoothing parameter. The choice of smoothing 
parameter given by cross-validation is compared to that given by the minimum mean lo- 
cal variance method. 

2. The Nearest Neighbor Regression Method 



pie, the value of Z,- is the measured pollutant level at latitude Xi and longitude y,- . Let 
{Xp\p ) be the point of interest of a neighborhood P. Z,- ’s may or may not be located in P. 
Define d,- as the euclidean distance of (.r,- ,y; ) to {Xp ,yp ). If is large we want Z,- to be 
less influential in the estimate of Zp than if it were small. Let 
m{Xpyp) = E{Zp IXp ,yp ^ i,Z 2 , ' ' Z„) . We are required to construct an estimate of 
m (Xp ,yp ). The proposed estimate is 



where X(d, ) is based on an appropriate weight function and the X(d, ) sum to one. More 
specifically, 



Let Z; be the information from data points positioned at (x,- ,y,- ). For 



m 



(Xp ,yp) = '^Mdi)Z, 
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where Qd{di) = . Since some stations are more active than others; if pi is the 

percentage of time the station located at (.v,- ) was active, then we use 

(D(£/,- )= Pi e 

The weights are dependent on distance such that they are relatively large for 
measurements within a relatively small radius of the estimated point. The size of do 
determines how relatively large the weights are. Weights are large for large do and small 
for small do- As do increases the amount of smoothing increases. Figure 1. gives shows 
the shape of the weight functions for constant /?,• . Each curve corresponds to a different 
do which covers 1, 2, 5, 10, 20, and 50 km. Note that the flattest curve in Figure 1. is for 
do=50km and thus gives the greatest amount of smoothing; while the steepest curve is 
for do=\km and gives the least amount of smoothing. 

These weights were used in the Populations at Risk to Environmental Pollu- 
tion (PAREP) project for estimatmg pollution concentration at the population centroid in 
various geographic areas. According to Merrill (1982), these weights were chosen for 
three reasons "(i) the estimated function would be smooth in the vicinity of the estimated 
points; (ii) the estimated function need not pass through aU measured points; (iii) the area 
integral of the estimated function should be finite, so that distant points can be ignored in 
the calculation." 

To avoid producing estimates from poorly monitored stations, some con- 
straints in estimation are used. The constraints are as follows. Let (J^(i),y(i)) denote the 
nearest data point to (Xp ,yp ). It follows that tf(i) is the minimum distance of a data point 
to {Xp ,y>p ). 



If d(i) > 3do exclude (.t(i),y(i)) and conclude that Zp is inestimable. 
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If (i(i) < 3d(, then all Z, within 4do will be used to estimate Zp . 

The 4c/(, criterion has been constructed to avoid sharp discontinuities at the midpoint 
between predicted points which are exactly 6d„ apart. This is especially important for 
estimating a surface for which this method may also be used. 

One drawback to this method is that these estimates are easily biased to clus- 
ters of data points. If the point of estimation is equidistant from 10 clustered points on 
one side and 1 point on the other side, the estimate using this procedure will be dominat- 
ed by the 10 clustered points when the information from the lone point on one side may 
be more valuable since it is the only measured point on that side. A weight which is in- 
versely proportional to the number of points within a specific distance of the observed 
point could be added to account for the clustering. Thus, points in clusters have less 
influence than those lone points on the estimate. Regardless of whether the weighting is 
by distance or cluster presence a smoothing parameter to detennine the amount of 
smoothing would be included and thus methods for choosing the appropriate smoother 
are still applicable. 

2.2 Cross-Validation to Determine the Amount of Smoothing 

To choose the "optimum" value of , a pollution estimate for each station 
location was produced from observations at other nearby stations excluding the station’s 
own observed value. This estimate is then compared to the actual observed value at the 
station location of the estimate for various d^, ’s. By varying d „ , we can choose those es- 
timates which minimize the prediction error. This method is called cross-validation and 
is nearly identical to the application in Waliba and Wold (1975). However, they used 
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only one measure of prediction error and in this paper four measures are considered. Tire 
idea to leave out one data point out at a time and see how well the various estimates fit 
the observed values is described fully in Stone (1974) or Geisser (1975). 

Prediction error is defined as the cross-validatory estimate minus the ob- 
served value. Wahba and Wold (1975) use the mean of the squared prediction errors and 
call this the cross-validation mean square error tecimique. Tliis technique is used to esti- 
mate from the data the appropriate degree of smoothing. In this paper, several composite 
measures of prediction error are used. 

Since any composite function of the prediction errors is heavily biased by 
outliers, the technique should be used vigilantly. Although, the possibility of this bias is 
not explored here, many points are estimated and compared so that the influence of 1 or 2 
outliers should not be substantial. 

The composite prediction errors were calculated as follows. Let 

z,- = logarithm of the pollutant concentration at location {Xi ). 
m* = estimate of Z (x; ,y,- ) from surrounding data points. 
n = the number of stations that could be estimated. 

Pi = the percentage of time stations i was active. 

The concentrations are in geometric means which are assumed to have a log- 
normal distribution. Thus, the logarithm of these values is distributed normally for large 
n. Therefore when observed or estimated values are discussed they are in units of the 
logarithm of the geometric mean concentration. 

The reliability of this estimate is to some extent a function of the station ac- 
tivity as well as the estimation procedure. For these error functions to reflect the aclequa- 
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cy of the estimation procedure and not the station activity they are weighted according to 
the percentage of time active. 

The sample mean value of the pollutant level, Z is 




Tlie sample variance of all the values of pollutant level, Z is 

5.2 = ^i'(z,-F)2. 

n 

Tlie weighted variance of these values is 
±PiiZi-2)^ 

S,l = 






The following error measures have been considered. 



(MSE) Mean Squared Error 



'(z,— m*)2 



MSE = 



(WMSE) Weighted Mean Squared Error 



WMSE = 



, 11 '’- 



(RMSE) Ratio of Mean Squared Errors 

RMSE = ^|^(zi-m;)2; 
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(WRJMSE) Weighted Ratio of Mean Squared Errors 

WRMSE = 



(MAE) Mean Absolute Error, Percent 



MAE = -X= ^ \zi—m*\ X 100; 
nz 



(WMAE) Weighted Mean Absolute Error, Percent 



1 



WMAE = — Tpi \zi-m,*\ X 100; 






(MPE) Mean Percent Error 



MPE = 



(WMPE) Weighted Mean Percent Eiror 



WMPE = 



_ ^ Pi \zj-m*\ 





X 


>r 


nzi 






^ Pi 


1 




Zi 


f — 1 





X 100 



.Pi 



These formulas are similar to those used by Breiman (1977) for comparing 
kernel and Parzen multivariate density estimation techniques. Wahba and Wold (1975) 
use mean squared error in their presentation. 
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2.3 Stnoothing Sulfur Dioxide and Suspended Particulate Data for Ohio State By Cross- 
Validation 



Sulfur dioxide and suspended particulate data for the state of Ohio were used 
to illustrate the selection of an optimal d„ by cross-validation. From 1974 through 1976 
there were 185 active stations for sulfur dioxide and 398 active stations for total suspend- 
ed particulate. For dg= {, 2, 5, 10, 20 and 50 km, the NNR method was used to predict 
the observed value at the location of each of the active stations by leaving out the 
station’s own value. The difference between the prediction and the observed value is the 
prediction error. We would like to choose the with the smallest composite prediction 
error over all observed values. 

The eight composite measures of error (explained in section 2.2) were com- 
puted for total suspended particulate and sulfur dioxide data in Ohio. Table 1. shows the 
value of each function for suspended particulate and different values of do . The value of 
do which minimizes the loss for the the majority of the unweighted error functions is 5 
kilometers. The RSE is the only one which is not consistent with this choice of an op- 
timal do . Whereas, the weighted error functions consistently choose do equal to 2 km. 
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TABLE 1. Error Functions by (km) for Suspended Particulate in Ohio State. 



i/o(km) 


MSE 


RSE 


MAE 


MPE 


WMSE 


WRSE 


WMAE 


WMPE 


1 


.07562 


*.500 


4.9 


4.9 


.0633 


.425 


4.5 


4.4 


2 


.06907 


.539 


4.8 


4.9 


*.0519 


*.404 


*4.2 


*4.1 


5 


*.06586 


.621 


*4.8 


*4.0 


.0531 


.477 


4.3 


4.2 


10 


.06945 


.696 


4.9 


4.9 


.0565 


.533 


4.4 


4.3 


20 


.08111 


.836 


5.3 


7.4 


.0673 


.631 


4.7 


4.6 


50 


.09337 


.963 


5.6 


5.7 


.0836 


.803 


5.7 


5.2 



^indicates the minimum value and thus corresponds to the optimum d^, . 



Table 2. is the analogue to Table 1. for sulfur dioxide data. Note that the 
both weighted and unweighted RSE is greater than one for all . Tliis means that the 
sum of squared error of the estimate derived from the NNR method is greater than the 
sum of squared error using the overall mean. In other words, 

Since the estimate contains only the ’nearest’ points and the mean contains all the points 
in the region, this could arise from measuring problems such as a large measurement 
error or biased measurements. It could also be that the NNR method is inherently wrong 
for estimating sulfur dioxide. Unlike suspended particulate, sulfur dioxide is a specific 
chemical compound which may diffuse or transform to other compounds in the atmo- 
sphere. Therefore SO 2 concentrations may vary more in a smaller neighborhood (i.e., 
have higher local variance). Suspended Particulate includes many compounds of a certain 
size (eg. dust) which may not be as apt to change in concentration locally. 
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TABLE 2. Error Functions by do for Sulfur Dioxide in Ohio State. 



do (km) 


MSE 


RSE 


MAE 


MPE 


WMSE 


WRSE 


WMAE 


WMPE 


1 


4.34 


4.42 


57.0 


54.9 


3.49 


3.91 


46.6 


43.1 


2 


2.52 


*3.56 


39.9 


38.2 


1.69 


*2.42 


*27.7 


*25.8 


5 


2.17 


3.98 


38.7 


38.1 


1.63 


2.82 


28.9 


27.1 


10 


2.11 


4.21 


39.8 


39.9 


1.60 


2.93 


30.6 


29.5 


20 


2.00 


4.32 


39.7 


39.8 


1.56 


2.96 


31.1 


29.8 


50 


*1.80 


3.95 


*38.2 


*37.5 


*1.38 


2.68 


*37.5 


29.7 



* indicates the minimum value and thus corresponds to the optimum 



3.1 Determining the Appropriate Smoother by Estimating Local Variance 



The appropriate amount of smoothing should be such that insignificant vari- 
ance among adjacent stations is smoothed out; and yet significant trends over larger geo- 
graphic areas are preserved. Tlius, perhaps a more direct method tlian cross-validation 
for finding the optimal d„ or a range of optimal do's is to construct a measure of the 
mean local variance (MLV) as a function of do for a specific area and find the do wliich 
minimizes it. That is, in the NNR method the do which is chosen should be approxi- 
mately that do , corresponding to a distance from each point, for which the point values 
within this distance are, on the average, the most homogeneous. This section explores a 
method for measuring mean local variance as a function of distance. The relationship 
between do and distance is explained below. 

Since the weights are proportional to 

(o(d,- ) = e 

points which are further away relative to do get less weight than points close to the 
estimated point. Therefore, for a particular spatial distribution of points. d„ can be 
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thought of as a distance at which some multiple of it makes the weights so small that the 
station’s infonnation receives insignificant weight. Furthermore, only points within four 
times do are used in the average; yet the points that are a distance of four times do give 
an absolute weight of 

(H{Ado ) = e-8 = .0003 

which can be a small or large relative weight. Since the weights in this method are rela- 
tive weights, a point which is far away, say 2.9 do , can have a weight=l .0 if it is the only 
point within 3xdo . Therefore choosing a relationship between distance and do for this 
weighting scheme depends on the particular data set and its precision. It is suggested 
here to estimate the mean local variance for a particular data set as a function of window 
size and choose the appropriate window size on this basis. The relationship between 
window size and do is positive but will vary among different data sets. 

Local variance is defined as the variance in the data within a small neighbor- 
hood. Local variance is measured in the following manner. A circular window is drawn 
around every point, defined by the station location, in the entire geographic area of 
interest. Thus, for every point there is an associated circular window for which it is the 
center. The radius of each window is chosen as a function of do and is allowed to vary 
from 0 to encircle the entire area. For a large enough radius there will be other points 
besides the center point contained inside these circular windows. For each point we can 
calculate a mean and a variance using the points which lie in its associated circular win- 
dow for a specified radius r . For example, denoting as the number of points within 
Zi ’s associated window and Zj as one of those points contained in the window where j 
ranges from 1 to k , , the window associated with z,- has the following local mean and vari- 



ance. 
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Local Mean: 




Local Variance: 

^ {Zj-Zi)^ 

To calculate the local variance of the entire region for a particular r , an average of the 
local variances is taken and is called the mean local variance. Local variance as a func- 
tion of distance is obtained by varying the size of the circular window. 

First, let us decide how to weight the local variances to find the mean local 
variance for a particular data set. Since a point may be in more than one window we 
should weight the local variances so that each point contributes a weight of 1 to the aver- 
age local variance for the entire region. Each window can be thought of as having ki—\ 
"degrees of freedom" where k( is the number of points in each circular window. If each 
window were independent of every other window (i.e. non-overlapping groups of points) 
then the weights would be 1 and mean local variance could be measured by the 
unweighted average of the local variances. But since there is apt to be overlap, each 

squared difference associated with each point is given a weight of — where n, = the 

number of circular windows containing point (x,- ,y,- ). With this weighting, each point 
contributes a weight of 1 to the mean local variance of the region. 

In order to calculate mean local variance we need a denominator which 
expresses the number of independent data points in the sum of local variances. We wUl 
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call this the "degrees of freedom". The "degrees of freedom" should take account of both 
the total number of data points and the amount of overlap in the circular windows. In 

fact, if each point contributes then the number of non-independent temis in the 



numerator is 




— . Therefore we have chosen the denominator to be iV— 
ni 




— because 
fU 



is the number of non-independent terms in the numerator. This gives the expected 



answer in a few simple cases which are given below. 

From the above discussion, we have constructed the following measure. 



Mean Local Variance (MLV) 







where 

= the total number of points in the region of interest. 

Zi = the value of pollution at the point ,v,- ). 

Tj = the mean value corresponding to the circular window associated with zj . 

kj = the number of points enclosed by zj ’s circular window of which 2 ,- is one of them. 



Let us examine this formula in light of a few examples. 



CASE 1. Every point is in only one circular window, hi this case /i,- = 1 and thus the 
denominator of MLV is zero. Since the radius is so small, no local variances can be 
measured. MLV is undefined in this case. 
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CASE 2. The radius is large enough that every point lies in every other point’s circular 
window. Tlien/i;=A and kj=N and 



which is simply the sample variance of the points for the entire region. In other words, 
MLV approaches the global sampling variance as the radius increases. 

CASE 3. Suppose there are n,- completely overlapping circular windows. In other 
words, there are m discrete clusters of points. In this case each cluster has its own mean 
and sum of squares associated with it. Therefore each point has within a particular clus- 
ter the same mean, zy and sum of squares associated with it as any other point in its clus- 
ter. In this case /i,- = kj , where kj is the number of points in the jth cluster, and 



This is similar to the within mean square in one way analysis of variance (Scheffe’, 
1959). 



CASE 3a. As a specific case of m discrete clusters, suppose each point has only one 
other point in its circular window. So there are a series of two overlapping windows 
which do not contain any other points. In this case rii = 2 for all i and 
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This is case 3 with two points in each cluster. This generalises to the case where every 
point is in k windows which do not overlap, then 



MLV = 




Since every circular window’s centered point is in every other point’s circle 
in the above examples of geographically clustered points, kj = /i, . For any particular 
{zi Zi may be contained in Hi different circular windows while zj may be con- 

tained in kj and tlius n, and kj are not necessarily equal. In other words, there may be 
windows that r,- is contained in but aU the points in those windows are not necessarily 
contained in z,- ’s window. 

Mean Local Variance is similar conceptually to the variogram function 
described in Joumel and Huigbregts (1978). The variogram describes variability 
between two quantities separated by a distance h; while mean local variance describes 
the variability in the data within a radius h. Like the variogram, to estimate mean local 
variance from the realizations of Z {x ,y ), it must be assumed that the variogram function 
depends only on h and not on the location of Z(x,y). This implies second-order sta- 
tionarity which is described in detail in Joumel and Huigbregts (1978). Although the 
variogram function may also be useful for finding the appropriate smoother or window 
size in NNR, this is not explored in this paper. However, we wiU explore the use of 
mean local variance for this purpose and the method using the variogram is analogous. 

Mean local variance is relevant to the choice of the smoothing parameter in 
the NNR method described earlier. If is to be large, then the minimum mean local 
variance should occur at a larger radius so that relatively large weight is given to points 
relatively far from the point of interest. In other words, the homogeneity of points 
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encompasses, on the average, a larger area for larger dg . Likewise, if dg is chosen to be 
small, points far away will be given relatively small weight and therefore the minimum 
mean local variance should occur at a smaller circle radius. 

Ohio is chosen as an example so that we can compare our minimum mean 
local variance choice of dg with our cross-validatory choice of dg in the previous section. 
Mean local variance is calculated for various radii. For each dg, two associated radii 
were chosen. One was Axdg which is the saine size that the NNR method uses in select- 
ing stations to compute its estimate of a point. The other radius is chosen such that the 
to(d, ) are equal to .01 at the edge die circular window. In Figures 2. and 3. the log 
(base e) mean local variance is plotted against the radius of the circular windows for 
Suspended Particulate and Sulfur Dioxide respectively. Logarithms are taken to magnify 
the lower values of the curve. In Table 3. the actual values calculated for Ohio are given. 
For comparison with the cross-validatory choices of the previous section, we chose win- 
dows corresponding xo dg = 1, 2, 5, 10, 20, and 50 km for calculating MLV. The radius 
which gives the most homogeneous window size corresponds to a =2 km by this 
method. These results are the same as those given by cross-validation using weighted 
loss functions and similar to that obtained using the unweighted loss functions (Table 1.). 
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TABLE 3. Mean Local Variance (MLV) as a function of do for Suspended Particulate 
and Sulfur Dioxide in the State of Ohio. 







Suspended Particulate 


Sulfur Dioxide 


do (km) 


radius (km) 


MLV 


^ .I)''; 


MLV 




1 


3.03 


.0583 


141.80 


.2129 


50.37 


1 


4.00 


.0573 


173.28 


.1934 


66.70 


2 


4.29 


*.0562 


181.37 


.1984 


71.00 


2 


8.00 


.0597 


240.35 


.1930 


101.46 


5 


10.74 


.0614 


264.68 


*.1887 


113.73 


5 


20.00 


.0662 


326.96 


.1989 


135.56 


10 


21.47 


.0671 


334.21 


.2008 


137.58 


10 


40.00 


.0819 


372.48 


.2526 


162.65 


20 


42.94 


.0837 


375.20 


.2644 


164.31 


20 


80.00 


.0927 


390.68 


.3166 


177.30 


50 


107.36 


.0945 


393.21 


.3468 


180.41 


50 


200.00 


.0957 


396.03 


.4063 


183.10 


** 


500.00 


.0972 


397.00 


.5069 


184.00 



* Tlie smallest mean local variance, corresponding to the optimal d „ . 

** do is that value which gives a radius large enough to contain all the points in Ohio. 
Notice that "df'=N-l since there are 398 active stations for suspended particulate and 185 
active stations for sulfur dioxide in the region. 



Li section 2.3, the cross-validatory analysis for sulfur dioxide using weighted 
loss functions suggests a = 2 km while the unweighted analysis leans toward a do = 
50 km. This implies that sulfur dioxide is more locally heterogeneous than suspended 
particulate. Minimum mean local variance analysis shows this is tme. Table 3. gives a 
minimum at a circle radius of 10.74 km wliich corresponds to do — 5 km. Notice that this 
solution creates no conflict of choice between 2 do s which are so far apart. The 
minimum radius chosen will be at least near the global minimum. 

In both figures 2. and 3., MLV increases after a minimum area and continues 
to increase steadily to the global sample variance of the entire region. Note the slight 
instability of the curve at smaller circular window size. Tliis instability is expecteil since 
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with small ki in each window we expect MLV to be extremely sensitive to the addition 
of new points with slight increments in window size. 

3.2 The Shape of Mean Local Variance 

Mean Local Variance was calculated for sulfur dioxide and suspended parti- 
culate in the states of New York and Florida. The results are revealed in this section. 
New York and Florida were chosen because, out of all states in the United States, these 
states have the second and third largest number monitoring stations respectively. For 
each state, the basic shape of the curve showing mean local variance against circular win- 
dow size is similar. Sulfur dioxide is consistently more locally heterogeneous than 
suspended particulate across all three states. (Tables 3., 4. and 5., along with Figures 2. 
thru?.) 



TABLE 4. Mean Local Variance (MLV) as a function of do for Suspended Particulate 
and Sulfur Dioxide in New York State. 







Suspended Particulate 


Sulfur Dioxide 


do (km) 


radius (km) 


MLV 


N-% — 


MLV 




1 


3.03 


.0484 


75 69 


.2023 


40.18 


1 


4.00 


.0480 


92.47 


.1851 


48.51 


2 


4.29 


.0493 


98.27 


.1845 


51.06 


2 


8.00 


*.0481 


132.76 


.1615 


71.95 


5 


10.74 


.0516 


154.79 


*.1614 


83.07 


5 


20.00 


.0608 


202.26 


.2080 


103.33 


10 


21.47 


.0631 


208.31 


.2079 


104.21 


10 


40.00 


.0808 


245.42 


.2588 


114.86 


20 


42.94 


.0837 


248.50 


.2608 


115.95 


20 


80.00 


.0974 


265.88 


.3085 


124.62 


50 


107.36 


.1033 


269.99 


.3537 


128.44 


50 


200.00 


.1107 


273.30 


.3739 


131.49 


** 


700.00 


.1168 


275.00 


.3839 


133.00 
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TABLE 5. Mean Local Variance as a function of dg for Suspended Particulate and Sul- 
fur Dioxide in Florida State. 



do 


radius (km) 


Suspended Particulate 


Sulfur Dioxide 


MLV 


1 

t 

1 


MLV 


N-% — 


1 


3.03 


.0580 


48.73 


.2365 


46.82 


1 


4.00 


*.0546 


69.20 


.2308 


59.74 


2 


4.29 


.0563 


72.56 


.2298 


62.00 


2 


8.00 


.0574 


105.35 


*.2176 


81.64 


5 


10.74 


.0608 


122.99 


.2325 


90.68 


5 


20.00 


.0654 


151.83 


.2426 


105.83 


10 


21.47 


.0668 


155.72 


.2423 


107.50 


10 


40.00 


.0716 


176.02 


.3138 


118.06 


20 


42.94 


.0716 


177.97 


.3151 


118.75 


20 


80.00 


.0841 


191.81 


.3222 


127.46 


50 


107.36 


.0866 


194.26 


.3452 


131.39 


50 


200.00 


.0918 


200.73 


.3595 


134.17 


** 


500.00 


.0907 


201.00 


.4017 


184.00 



* The smallest mean local variance, corresponding to the optimal d„ . 

** dg is that value which gives a radius large enough to contain all the points in the 
region. 



States and pollutants have different window sizes that give minimum mean 
local variance. Since there are different dispersion and dilution mechanisms in different 
geographic areas, in addition to the different spatial distribution of the samples we should 
expect different choices across states and pollutants. Tire table below shows the radii at 
which the minima occur among states and pollutants. 



Radius (km) Radius (km) 

Suspended Particulate Sulfur Dioxide 

Ohio 4.29 10.74 

New York 8.00 10.74 

Florida 4.00 8.00 

In Figures 4. through 7., the line graphs are drawn to see the shape of MLV 
and there are basically three regions on each curve. The first region is called the 
Undefined Region. This area ranges from radius of 0 to a radius large enough to have at 
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least one circular window containing two points. The size of the undefined region is 
determined by the particular geographic area one is studying and the spatial distribution 
of stations in that area. The second region is called the Region of No-Precision. The 
Undefined Region is included in the Region of No-Precision. This region corresponds to 
circular windows which are too small to have enough points to measure local variance. 
This area is not specifically defined but can be estimated by looking at the "degrees of 
freedom" and plots of the logarithm of the radius vs. MLV. The "degrees of freedom" 
specify the amount of overlap and therefore how many points lie in each other’s 
corresponding windows. The Region of No-Precision is designated in each of the plots, 
yet the boundary of this region is not shaip. The third region is for radii greater than 
those in the no-precision region. Window sizes in this region are large enough to contain 
enough points so that local variance can be measured. It is in tliis area that the optimum 
choice for d^, exists. 

For the sake of example, plots of the "degrees of freedom (df)" for Ohio 
state’s suspended particulate against logarithm of the window radius are given in Figure 
8. Note that "df' increases quickly when radii are small and then levels out as they get 
large enough to include almost every station in the area. Degrees of freedom are a mono- 
tonic function of window size. The plot of "df versus the log of the radius is an S- 
shaped curve. It is within the steep part of the S that the minimum MLV is likely to 
occur. 

It is relatively easy to choose the optimum circle radius if MLV has the func- 
tional behavior shown in most of these curves. In Figure 4. it is more difficult because 
MLV does not increase steadUy but bounces around initially. Using the method 
described in the previous paragraph, we would choose the minimum, which in this case. 



lies in the area of no-precision, since it is the first trough-like area after a rapid increase. 
This window radius is only one kilometer. Knowing sometliing about the spatial distri- 
bution of monitoring stations, it is seen that one kilometer would not allow many stations 
to have enough points contained in their windows to measure local variance. With this 
choice, the area of no-precision would have to end at a radius of .5 kilometers 

corresponding to an MLV with 8 "degrees of freedom". This implies that = 268. 

Since there are only 276 stations altogether, /i, must be 1 for most stations. Tlius at a cir- 
cle radius of .5 kilometers, there are an insufficient number of points to measure mean 
local variance and thus it should be considered imprecise at this point. Tlierefore, care 
must be taken in the choice of local minimum. We must closely examine the degrees of 
freedom to see how reliable MLV is at various circular window sizes before choosing the 
minimum. 

Perhaps there are two ways one can use mean local variance to choose the 
"best" circular window radius corresponding to the "best" cIq. One method is to choose a 
point estimate, the minimum radius greater than the "no-precision" area’s upper limit. 
This method might not be the best since the MLV curve is data dependent, and therefore 
fluctuates around the minimum making it difficult to determine. Another involves choos- 
ing a region of radii called the "Region of Homogeneity". In each of the curves of MLV, 
after the region of no-precision there is a trough-like area in which the minimum occurs 
and then MLV starts to increase rapidly. We could call this trough-like area the Region 
of Homogeneity. Choosing a larger radius allows us to use more points for reliability. 
Yet, we do not want to use insignificant points. Tlius choosing the largest possible radius 
before MLV starts increasing rapidly towards the global sample variance may be another 
acceptable method. In other words, one should not take any points into tlie estimate 
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which are outside the area of homogeneity. Wliich point is chosen as giving the most 
homogeneous window size is not critical for estimation. However, it is important to 
know which window sizes not to use, and those sizes can be determined easily by investi- 
gating MLV and its relationship with circular window size. 

4. Conclusions 

A non-parametric regression method has been used to estimate pollution 
concentrations for particular geographic areas. To determine the •'^.'t value of the 
smoothing parameter d^, cross-validation techniques were used. Cross-validation has 
been used as a method for finding the appropriate level of smoothing by the work of 
Wahba (1975) and others. Use of the concept of mean local variance (MLV) is explored 
as an alternative means of finding d^, . An example of the MLV approach has been given 
here using these data. Further work would include simulations of Z (x ,y ) for finding the 
distributional properties of mean local variance. 

Minimum mean local variance analysis could also be used for choosing 
parameters in other methods of spatial estimation. For example, mean local variance can 
be plotted as a function of the number of nearest neighbors or the size of the partition and 
thus the minimum mean local variance could correspond to the optimum number of 
neighbors or partition size in the same way it does d^ . 

Mean Local Variance is also of interest in its own right. MLV can be useful 
for comparing geographic regions, spatial sampling distributions and different pollutant’s 
dispersion mechanisms. 
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FIGURE 1 . Weight as a function of distance from estimated point. 
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FIGURE 2. Mean Local Variance vs. Logarithm of Radius 
For Suspended Particulate in Ohio. 
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FIGURE 4. Mean Local Variance vs Log of Radius 
for Suspended Particulate in NY state. 
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FIGURE 5. Mean Local Variance vs Log of Radius 
For Sulfur Dioxide in NY State. 



Log(Radius of Circular Window) 



MLV 




FIGURE 6. Mean Local Variance vs Log of Radius 
For Suspended Particulate in Florida State. 
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FIGURE 7. Mean Local Variance vs Log of Radius 
For Sulfur Dioxide in Florida State. 
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FIGURE 8. Degrees of Freedom vs Logarithm of Circle Radius 
For Suspended Particulate in Ohio. 
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